ArkDTA: attention regularization guided by non-covalent interactions for explainable drug–target binding affinity prediction

Abstract Motivation Protein–ligand binding affinity prediction is a central task in drug design and development. Cross-modal attention mechanism has recently become a core component of many deep learning models due to its potential to improve model explainability. Non-covalent interactions (NCIs), one of the most critical domain knowledge in binding affinity prediction task, should be incorporated into protein–ligand attention mechanism for more explainable deep drug–target interaction models. We propose ArkDTA, a novel deep neural architecture for explainable binding affinity prediction guided by NCIs. Results Experimental results show that ArkDTA achieves predictive performance comparable to current state-of-the-art models while significantly improving model explainability. Qualitative investigation into our novel attention mechanism reveals that ArkDTA can identify potential regions for NCIs between candidate drug compounds and target proteins, as well as guiding internal operations of the model in a more interpretable and domain-aware manner. Availability ArkDTA is available at https://github.com/dmis-lab/ArkDTA Contact kangj@korea.ac.kr


Introduction
Identification of drug-target interactions (DTIs) is a central task in drug design and development. Due to the costly and labor-intensive nature of traditional drug development process based on in vivo and in vitro experiments, deep learning models for protein-ligand binding affinity prediction have gained recognition (Chen et al. 2018). However, limited model explainability remains an obstacle to the adoption of such models by domain experts (Preuer et al. 2019). With the unique ethical and regulatory requirements, there is a growing demand for interpretable deep models in the field of biomedicine. In recent works, attention-based methods were studied to address the issue of explainability (Liang et al. 2021). Critical domain knowledge should be integrated to ensure that the model's implicit assumptions are compatible with expert opinions (Dash et al. 2022).
In DTI, one such key concept is that of protein-ligand noncovalent interactions (NCIs). NCIs are essential for understanding how proteins and ligands interact and form complexes with each other, which affects the mechanism of action for drug compounds (Tang et al. 2017;Chen et al. 2019;Anighoro 2020;Aljoundi et al. 2020). Most drug compounds are small organic molecules that act as ligands and interact with proteins to carry out their functions. The majority of drugs deliver their effects by forming noncovalent bonds with their biological targets. NCIs induce conformational changes in target proteins which influences the overall binding affinity. This is crucial for the stabilization of the protein-ligand complex in its final form (Davis and Phipps 2017;Aljoundi et al. 2020).
Despite being highlighted as a fundamental concept in protein-ligand affinity prediction task, few studies have addressed the importance of protein-ligand NCIs. While MONN (Li et al. 2020) explicitly utilized NCIs in its auxiliary task, the resulting pairwise interaction matrix between all protein residues and all ligand atoms is limited in its capacity to differentiate active and inactive binding sites. On the other hand, AttentionDTA sought to distinguish active and inactive residues using parameterized weights in their attention mechanism without explicitly using NCI labels. However, its attention mechanism is based on the convoluted features from each of its respective modality-wise encoder modules. We addressed the limitations of both previous works by utilizing NCI labels in the attention mechanism to identify active protein residues (Zhao et al. 2022a).
We present ArkDTA, an explainable deep DTI prediction model with NCI-aware attention regularization. Taking as input a set of protein residues and a set of ligand substructures, our novel regularization method modulates the distribution of cross-modal attention weights from the residues to chemical substructures in a manner that allows a distinction between active and inactive residues. Our modified cross-modal attention module appends a pseudo-substructure embedding to the set of key chemical substructures and focuses the attention on the pseudo embedding where the query protein residue is inactive. Examining the final attention weights yields qualitative insights into the model's internal operations.
Experimental results on three benchmark datasets reveal that ArkDTA achieves predictive performance comparable to the current state-of-the-art models while significantly improving model explainability. Qualitative investigation into the attention maps demonstrates our model's ability to identify NCI-forming regions in seen and unseen protein-ligand complexes as well as highlight chemical substructures commonly used as pharmaceutical agents.

Dataset
Three different benchmark datasets were used in this study which are PDBbind version 2020 (PDBbind), Davis et al. (Davis), Metz et al. (Metz) to conduct experiments on ArkDTA and baseline models (Davis et al. 2011;Metz et al. 2011;Liu et al. 2017). The ith data instance X i in each of these datasets consists of a protein-ligand pair with its binding affinity score, expressed in one of the following measurement types: inhibition constant (K i ), dissociation constant (K d ) and inhibitory concentration 50 (IC 50 ).
For the purpose of this study, the PDBbind was sub-divided into two subsets according to the binding affinity measurement type. The KIKD subset consists of all protein-ligand instances whose binding affinity scores are expressed as K i or K d value, and the remaining instances whose affinity scores are expressed as IC 50 value were combined to form the IC 50 subset. The Davis and Metz dataset contains protein-ligand pairs with only KIKD-based affinity scores.
We applied several data curation methods to our constructed datasets. For each dataset, protein-ligand pairs whose number of amino acids in the protein sequence exceeds 1000 or the exact affinity value is unavailable (e.g. expressed as inequality ">50000 M") were excluded. We then normalized binding affinity scores in each dataset into values in unit "M" and subsequently transformed them into log space for consistent comparison (Ö ztü rk et al. 2018). Table 1 shows the total number of proteins, ligands, and curated data instances in two measurement types (KIKD, IC 50 ) for each dataset. Overall, each ith data instance X i in the binding affinity datasets is defined as the following, where p i , c i , y i 2 R are input protein, ligand, and annotated binding affinity value, respectively. Table 1 shows the statistics of each dataset.
To regularize ArkDTA's attention mechanism, we further augmented the preprocessed PDBbind dataset with NCI labels. We used Protein-Ligand Interaction Profiler (PLIP) to extract the NCI labels from each binding complex structure contained in the original PDBbind dataset (Adasme et al. 2021). The NCI labels for each protein-ligand pair are represented as a m Â n 2D binary matrix to indicate the presence of any type of NCIs (e.g. hydrogen bonding, salt bridges) where m and n are the numbers of amino acid residues and atoms in ligand, respectively. Since the attention mechanism in ArkDTA is based on cross-modal interactions between the protein residues and chemical substructures, we converted the binary matrix into a m-dimensional binary vector k ! where each residue in protein is labeled as 1 if it has NCI with at least one atom in its ligand partner. We use k ! as ground truth residue-wise NCI labels for attention regularization in ArkDTA. The augmented ith data instance 'X i in the PDBbind dataset is defined as the following, where k ! i 2 ½0; 1 n is the converted binary vector indicating the presence of each residue's NCI with the input ligand c i .
Despite having the least data instances, our preprocessed PDBbind dataset is the primary dataset of this study since it contains the residue-wise NCI labels. We randomly partitioned the PDBbind dataset into 5-folds where 5% of the training instances in each fold were used for validation. This split method yields an ensemble of five models trained on different folds of this dataset. The Davis and Metz dataset used for fine-tuning purposes was partitioned into training and test instances (8:2) where 5% of the training instances were also used for validation. The purpose of these two datasets is to fine-tune each of the five models previously trained on the PDBbind dataset.

Overview
Our ArkDTA model consists of 'Protein Encoder Module', 'Ligand Encoder Module', 'Protein-Ligand Integration Module', and 'Affinity Prediction Module'. The first two modules encode input data into protein residue-wise and chemical substructure-wise representations, respectively. The 'Protein-Ligand Integration Module' refines the residue-wise representations based on their attention weights given the substructure-wise representations and subsequently aggregates them into a single binding complex representation. Finally, the 'Affinity Prediction Module' takes the binding complex representation as input and predicts the binding affinity value as output. The formal definition for ArkDTA is the following,ŷ where the input are protein p and ligand (compound) c, and output is the predicted binding affinity valueŷ. Figure 1 shows the overall model architecture of ArkDTA.

Protein Encoder Module
The 'Protein Encoder Module' takes a protein p as input and encodes it into a set of m d-dimensional residue embeddings as output. The initial representation for input protein is its FASTA sequence. While such 1D-based representations may have limitations in representing proteins, large-scale language models have been introduced to alleviate these issues (Rives et al. 2019). These models have shown promising results in protein structure and function prediction tasks. We imported a pre-trained protein language model called Evolutionary Scale Model (ESM) and its tokenizer to obtain residue embeddings (Rives et al. 2019;Lin et al. 2022). The tokenizer converts the input protein p into a sequence of tokens subsequently fed to the ESM model. Finally, the ESM model converts the input tokens to a set of protein residue embeddings R 2 R mÂd . The model version of its pretrained weights is ESM-2 (8M) where its number of layers is 6.

Ligand Encoder Module
The 'Ligand Encoder Module' takes a ligand c as input and encodes it into a set of n d-dimensional substructure embeddings as output. The initial representation for input ligand is its SMILES string. The SMILES string is first converted into a Morgan fingerprint represented as a 1024-dimensional bit vector f ! 2 ½0; 1 1024 . Each bit position in the vector indicates the presence of its corresponding chemical substructure. The 'Ligand Encoder Module' leverages this information by gathering the positional indices of that vector where its bit is 1 and uses a lookup table to obtain a set of trainable d-dimensional chemical substructure embeddings S ¼ fs 1 ; s 2 . . . s n js i 2 Sg where n is the number of chemical substructures extracted from f ! . S is a set of 1024 trainable chemical substructures where each of them corresponds to its bit position in the Morgan fingerprint.

Protein-Ligand Integration Module
The 'Protein-Ligand Integration Module' consists of a Multihead Attention Block (MAB) and a Pooling Layer. The MAB refines an input set of protein residues R from 'Protein Encoder Module' based on our novel attention mechanism with another input set of chemical substructures S from 'Ligand Encoder Module'. The Pooling Layer subsequently aggregates the refined residue-wise embeddings into one single binding complex embedding. The MAB's operations reflect the conformational transitions proteins undergo when bound to a ligand, while the Pooling Layer's output corresponds to the final protein-ligand complex that determines the binding affinity value.
The MAB in 'Protein-Ligand Integration Module' employs multihead attention mechanism and produces a set of 'refined' residues given R and S as inputs (Vaswani et al. 2017;Lee et al. 2019). Following the definitions made by previous works, the MAB takes R, S, and S as queries, keys, and values, respectively.
Let A 2 R mÂn be the calculated attention weights between the query and key linear projections of R and S, respectively. For each ith residue (i 2 f1; 2; . . . ; mg), the attention weights are distributed across all n corresponding chemical substructures where P n j¼1 A i;j ¼ 1. However, as most residues do not form NCIs with the incoming ligand, it may be undesirable to utilize all calculated attention weights.
Our modified version of MAB first appends a trainable universal d-dimensional pseudo-substructure embedding p ! 2 R d to current set of chemical substructure embeddings S. The main purpose is to regularize the attention between the query protein residues and key-value chemical substructures based on their NCIs. Specifically, we devised a strategy that makes the attention weights from non-binding query residues (i.e. residues having no NCIs with ligand) skewed toward the key pseudo-substructure embedding. For binding query residues (i.e. residues having NCIs with ligand), the attention weights are prevented from being skewed toward the pseudosubstructure but distributed to actual chemical substructure embeddings in an unsupervised fashion. We denote this modification as Attention Regularization based on NCIs in MAB (ARK-MAB).
The ARK-MAB that takes R and S as input is mathematically expressed as follows, where R Ã 2 R mÂd is a set of 'refined' residue embeddings, 'LayerNorm' is layer-wise normalization method (Ba et al. 2016) and RFF is row-wise feedforward layer. MultiAttn is kheaded attention layer where X is linearly projected to query vectors and Y is linearly projected to both key and value vectors. For the calculation of attention weights, we employed Additive Attention originally proposed by Bahdanau et al. (2014) and used four attention heads. We adopted Pooling by Multihead Attention (PMA) from the Set Transformer framework (Lee et al. 2019) for the Pooling Layer. The m refined residue embeddings R Ã 2 R mÂd are aggregated based on a set of u trainable seed vectors U 2 R uÂd into a set of u aggregated residue embeddings R a 2 R uÂd . Following the explanation in the Set Transformer paper, the PMA layer is built based on the MAB that takes U, R Ã , and R Ã as queries, keys, and values, respectively. Subsequently, the aggregated residues are concatenated vector-wise and reduced to a d-dimensional binding complex embedding via a simple linear layer. The order of vector-wise concatenation is determined by the fixed order of seed vectors U.
The PMA layer that takes the refined residues R Ã as input is mathematically expressed as follows, where C 2 R 1Âd is the binding complex embedding built from vector-wise concatenation () of the aggregated residues fr a i 2 R a ji ¼ 1; 2; . . . ; ug. Linear is linear layer without nonlinear activation that reduces the binding complex embedding's expanded dimension to d where the weights and bias are W linear 2 R d _ uÂd , b linear 2 R d , respectively. Figure 2 shows the detailed description of ARK-MAB and PMA.

Affinity Prediction Module
The 'Affinity Prediction Module' that takes the binding complex embedding C as input for predicting the binding affinity scoreŷ is mathematically expressed as follows,

Training and optimization
The loss objective for training ArkDTA consists of two terms which are the main and auxiliary loss objective. The main loss objective is based on root mean squared error (RMSE) between the binding affinity predictions and each of their corresponding values. The auxiliary loss objective was specially designed to impose regularization on the attention mechanism utilized in the 'Protein-Ligand Integration Module' using binary cross entropy as its criterion. The batch-wise main loss objective for binding affinity prediction is mathematically expressed as follows, whereŶ is a b-sized batch of predicted binding affinities, Y is a b-sized batch of ground truth binding affinities, and MSE is mean squared loss criterion for binding affinity prediction. For attention regularization described in the 'Protein-Ligand Integration Module', let A þ 2 R mÂðnþ1Þ be the calculated attention weight matrix averaged head-wise, given the set of m protein residue embeddings R Ã 2 R mÂd as queries and set of n chemical substructure embeddings S 2 R ðnþ1ÞÂd appended with pseudo-embedding p ! as keys. For each residue in A þ , the summation of n attention weights corresponding to n chemical substructures is equivalent to the NCI score deemed as the predicted class probability having NCI with the ligand compound. On the contrary, the attention weight corresponding to the pseudo-substructure is deemed as the predicted class probability having no such interactions. Figure 3 illustrates how the attention mechanism in the ARK-MAB works. If the ith residue does not have any NCI with the ligand, the ARK-MAB is guided to generate attention weights In other words, the ith residue is expected to be mostly attended against the pseudosubstructure p ! and relatively less attended to the actual chemical substructures. On the contrary, if the residue has an actual NCI, the guided attention weights are distributed to actual chemical substructures in an unsupervised fashion. The batch-wise auxiliary loss objective for attention regularization is mathematically expressed as follows, A i þ ¼ ða rc Þ 1 r mi;1 c niþ1 (17) where A is a b-sized batch of attention weight matrices averaged head-wise, K is a batch of ground truth NCI labels, A i þ is the ith attention matrix where number of rows and columns are m i and n iþ1 respectively,k are the residue-wise predicted  ArkDTA: Attention regularization guided by NCIs i451 NCI probabilities based on column-wise summation of A i þ except the last n i þ 1th column which corresponds to the pseudo-substructure p ! ,K is a batch of predicted NCI probabilities, and 'CrossEntropy' is the binary cross entropy loss criterion for attention regularization based on NCI. For the ith instance in batch, m i and n i are the number of residues in protein and substructures in ligand, respectively. Recall that k ! is a m-dimensional binary vector where each bit indicates the corresponding residue having a NCI with its ligand partner. Supplementary Figure S3 provides an illustrative description for the auxiliary loss objective L 2 .
Overall, the total loss objective for training ArkDTA is mathematically expressed as follows, where L is a sum of two loss objectives. a is the NCI-based auxiliary loss coefficient that determines the intensity of guiding cross-modal attention mechanism in ArkDTA's ARK-MAB. The base dimension size for all embeddings is set to d ¼320 while the number of trainable seed vectors U is 2. All ArkDTA and its ablations were trained to a maximum of 100 epochs with batch size of 64 and early stopping. The hyperparameters including learning rate and auxiliary loss coefficient a for ArkDTA were determined by its prediction performance on validation instances. We used the Adam optimizer with weight decay 0.0 for training ArkDTA on the binding affinity datasets. The learning rate was set to 0.00005 and 0.0001, while the auxiliary loss coefficient a was set to 5.0 and 1.0 for the KIKD-based (PDBbind KIKD Subset, Davis, Metz) and IC 50 -based (PDBbind IC50 Subset) datasets, respectively.

Experiment settings
We trained ArkDTA on each subset (KIKD and IC 50 ) of the PDBbind dataset which contains the NCI labels necessary for regularizing ArkDTA's drug-target cross-modality attention mechanism. Since the data partition method is based on 5-fold cross-validation, we trained ArkDTA on each fold's training instances and subsequently evaluated its binding affinity prediction performance on the corresponding test instances (Test PDBbind-KIKD). The evaluation metrics used in the experiments are RMSE, mean absolute error (MAE), Pearson's correlation (PCORR), and concordance index (CI). All evaluation metrics were calculated using the mean and standard deviation of the five folds.
In addition, we gathered each fold's test instances where each of its drug compound's scaffold is not present in any of the other drug compounds of the training partition. A scaffold is a molecular core structure of a drug compound that determines its overall biochemical activity and is essential in drug design. We extracted each ligand's Bemis-Murcko scaffold via the RDKit library (Bemis and Murcko 1996). If the input ligand in a test data instance has a molecular scaffold that does not overlap with those of all training ligands, we deemed it as a test instance with unseen scaffold. To further evaluate ArkDTA's robustness on compounds with unseen scaffolds, we additionally calculated the evaluation metrics on such data instances (Test PDBbind-Unseen Scaffolds).
We then loaded each of its model checkpoint to conduct additional experiments on the Davis and Metz dataset. By means of transferring NCI-related knowledge, we fine-tuned each of the models previously trained on the PDBbind dataset and evaluated them on the same test instances of the Davis and Metz dataset. The evaluation metrics were calculated based on the mean and standard deviation of five sets of individual model performances. Since only the KIKD-based instances are present in the Davis and Metz dataset, we evaluated its performance on only KIKD-based affinity predictions. The same experimental setting was applied to ArkDTA's baselines and ablations as well.

Baseline models and ArkDTA ablations
The binding affinity prediction models used as baselines are  , and IIFDTI (Cheng et al. 2022). Details on each of the baseline models can be found in Supplementary Table S1. Among those that employed cross-modality attention, only MONN and ArkDTA explicitly utilized the NCI labels to improve this mechanism. While MONN introduced a secondary downstream task for predicting 'atom-residue pairwise' NCIs, our model alternatively used 'residue-wise' NCIs by means of attention regularization since its ligand representation is based on set of chemical substructures.
The model hyperparameters and implementation were imported from each of their original works. Note that some baseline models were originally implemented to predict binary interaction outcomes instead of continuous binding affinity values. To circumvent this issue, we replaced the downstream classifier layers with the regression ones for TransformerCPI, HyperAttentionDTI, and IIFDTI.
To investigate the effects of our model design choices, we made the following model ablations for ArkDTA, • Remove L 2 : We removed the auxiliary loss objective L 2 by setting the loss coefficient a to 0 which leaves the model being solely trained on the main binding affinity prediction task (L ¼ L 1 ). The purpose of this ablation is to . . . ; q m g and fk 1 ; k 2 ; . . . ; k n ; k p g, respectively. The auxiliary loss objective L 2 enforces ArkDTA to focus most of the attention toward k p when given a query residue without NCI. On the other hand, the L 2 encourages ArkDTA to only distribute its attention on actual chemical substructures from k 1 to k n in an unsupervised fashion. i452 Gim et al.
probe the effects of attention regularization given the residue-wise NCI labels as ground truths. • Freeze ESM: We froze the pre-trained weights of the imported ESM model used in ArkDTA's 'Protein Encoder Module'. The purpose of this ablation is to investigate the effects of fine-tuning which is a common practice in optimizing large-scale language models on other downstream tasks. Table 2 shows the 5-fold cross-validation results on the PDBbind dataset. For the KIKD subset of the PDBbind dataset, ArkDTA performed slightly better than its ablated version without NCI-based auxiliary loss but fell behind one of the baseline models AttentionDTA in all evaluation metrics including the test instances with unseen scaffolds. For the IC 50 subset of the PDBbind dataset, ArkDTA showed best performance compared to its ablations and baseline models except RMSE and MAE for all test instances. Table 3 shows the additional results on the Davis and Metz dataset. Among the baselines, we selected the top four performing models DeepDTA, MONN, AttentionDTA, and BACPI based on their performance in the KIKD subset of the PDBbind dataset. Among the five models, AttentionDTA overall showed best performance in both the Davis and Metz dataset.

Analysis on attention weights
For qualitative analysis, we performed model inference and visualized the attention weights using heatmaps and compared them with actual binding complexes and ligand compound structures obtained from the Protein Data Bank (PDB) and PubChem database (Burley et al. 2023;Kim et al. 2023).
Given a protein-ligand pair input represented as a set of m residues and n chemical substructures respectively, the attention weights between m protein residues (queries) and n chemical substructures appended with a pseudo-substructure (keys) in each head are represented as a 2D matrix A þ . We calculated the NCI score for each protein residue as shown in Equations (17) and (18). Since proteins are generally long sequences, we transposed the 2D matrix and truncated residue regions where both NCI scores and labels are deemed negative (i.e. no NCI between the residue and ligand). As shown in Fig. 4a, the protein-ligand binding complex (Schrö dinger and DeLano 2020) and molecular structure of the ligand are the left and right side, respectively. The following details that describe the attention map displayed on the center side are the following.
• The m columns correspond to the m-sized amino acid sequence of the input protein. After obtaining the attention maps and related visualizations, we performed three different case studies.

Seen protein & unseen ligand binding complex (4x6n, 3Y5)
In this case study, we selected an input protein contained in both training and test partition of the PDBBind dataset and a ligand that is only in its test partition. Figure 4a shows the visualization results performed on a binding complex structure (4x6n) of factor XIa with the inhibitor 1-f(1S)-1-[4-(3-amino-1H-indazol-6-yl)-5-chloro-1H-imidazol-2-yl]-2-phenylethylg-3-[5-chloro-2-(1H-tetrazol-1-yl)benzyl]urea. Based on comparison between the calculated residue-wise NCI scores and its actual ground truth NCI labels, ArkDTA was able to not only identify the NCI positive residues but also identify their local regions as well. The highlighted areas of the protein-ligand binding complex structure also align with actual residue sites that seem to bind the incoming ligand.

Unseen protein-ligand binding complex (6n77, KEJ)
In this case study, we selected a protein-ligand pair that is only in the test partition of PDBbind dataset. Figure 4b shows the visualization results performed on a binding complex structure of the JAK1 kinase domain (6n77) with the inhibitor ligand N-[3-(5-chloro-2-methoxyphenyl)-1-methyl-1H-pyrazol-4-yl]pyrazolo[1,5-a]pyrimidine-3-carboxamide (KEJ). In this case, despite the test instance being a unseen protein-ligand pair, the protein residues predicted as having NCIs based on their attention scores seemed to form plausible binding pockets for the ligand. Notably, ArkDTA highlighted the cyclic regions containing pyrazole-based substructures (cC(N)cn(C)n) which are renowned as important pharmacological scaffolds (Karrouchi et al. 2018). Figure 4c shows the visualization results performed on a crystal binding complex structure (8bq4) of therapeutic target phosphatidylinositol 5-phosphate 4-kinases (PI5P4Ks) with the inhibitor ligand 6-methyl-N-(4-methylsulfonylphenyl)thieno[2,3-d]pyrimidin-4-amine (QZR) (Rooney et al. 2022). We examined the attention map to see whether ArkDTA is able to identify potential binding residues of an out-of-dataset drug-target pair. Interestingly, some of the residues in PI5PAks identified as having NCIs were located near the ligand. The surrounding residue sites may act as guidelines for generating grids prior to docking simulation. This demonstrates that ArkDTA has developed its own understanding on active protein residues due to the NCI-based attention regularization technique. One of the most highlighted substructures in the ligand is the sulfonyl functional group (cc(c)S(C)(¼O)¼O), which is commonly used in synthesizing drug compounds (Feng et al. 2016). Another highlighted substructure is related to pyrimidine (cc(N)ncn), a therapeutic scaffold which has various biological roles such as antiviral and antimalarial agent Kumar and Narasimhan (2018).

Out-of-dataset binding complex (8bq4, QZR)
ArkDTA: Attention regularization guided by NCIs i453  Figure S4 shows a comparison between ArkDTA and its ablated version of not employing NCI-based attention regularization. As shown in the ablated verion's attention map, all protein residues are treated as having NCIs with the ligand. While this visualization may inform researchers with significant chemical substructures, it has limited explainability on the protein side. This highlights the role of NCIs in attention regularization which includes providing more salient information of the binding complex leading to better explainability. (Full-sized attention maps for the three case studies are available in Supplementary Fig. S5. Additional analysis on attention weights extracted from other protein-ligand complexes for each of the case studies are available in Supplementary Fig. S6.)

Effect of attention regularization guided by NCIs
Preliminary statistical analysis shown in Supplementary Fig.  S2 indicates that there is no obvious correlation between the ratio of active and inactive residues in a given protein and the binding affinity value of the protein-ligand complex. This suggests that our model's ability to incorporate key domain knowledge such as NCIs in inference may not translate directly to improvements in binding affinity predictive performance. Nonetheless, our model was one of the three highest performing models in all evaluation metrics in above reported experimental setups, maintaining robust performance while significantly improving model interpretability. The auxiliary loss objective L 2 guides our model in identifying residues participating in NCIs with ligand substructures, distributing attention weights in a differentiated manner. The resulting attention maps and weights can be further investigated in order to gain insights on the potential interaction sites between newly designed candidate drugs and novel target proteins.

Limitations of ArkDTA and future work
A simplifying assumption for the representation of NCIs was that binary values indicating presence or absence of NCIs would provide sufficient information on the underlying chemical system. However, NCIs are typically sub-categorized according to characteristics such as their geometrical configurations, interaction strengths, and the kind of chemical force involved. Future works can take into account how different types of NCIs affect the overall binding behavior as proposed by Choe et al. (2022). Another potential limitation in our work is the restrictive size of the training dataset. PDBbind is unique among publicly available datasets in that it provides coordinate data for each protein-ligand complex, which can be used to obtain NCI markers using tools such as PLIP (Adasme et al. 2021). However, PDBbind suffers a sparsity problem, containing binding affinity values for only a limited number of protein-ligand pairs relative to other benchmark datasets such as Davis and KIBA (Pahikkala et al. 2015;He et al. 2017). In future works, adoption of transfer learning and data augmentation methods can be explored to address this issue.
Despite the promising results from qualitative analysis, ArkDTA's performance in four quantitative evaluation metrics does not fully capture the benefits of our NCI-aware regularization method in terms of model generalizability. We speculate that this is partly due to the absence of dedicated multi-objective loss function and optimization technique. In future work, we plan to design a multi-objective loss function and optimizer such that the relative weight given to each loss objective can be determined to minimize the potential antagonism between two loss objectives. In addition, we will investigate how changing the auxiliary loss coefficient a affects the model's performance on the binding affinity prediction task.

Conclusion
In this work, we introduce ArkDTA, a protein-ligand binding affinity prediction model that employs a novel attention regularization technique guided by NCIs. While there is still room for improvements in the predictive performance, our model achieves significant improvements in model explainability over existing models. Furthermore, we found upon qualitative analysis of attention maps that the final distribution of attention weights can be used to gain insights into the model's internal understanding of the underlying chemical system as well as suggest protein residues and chemical substructures of high pharmaceutical relevance. The results are based on mean and standard deviation of its five different model's performance on the same test instances.